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A dataset containing 1924 observations used in this study to evaluate the 
effect of 435 different independent variables on one dependent variable. Big 
data has some issues such as irrelevant variables and outliers. Therefore, this 
study focused on analyzing and comparing the impact of three different 
variable selection based on machine learning techniques, including random 
forest (RF), support vector machines (SVM), and boosting. Further, the M 
robust regression was applied to address the outliers using M-—bi square, 
M-—Hampel, and M—Huber. Random forest and M-Hampel results revealed 
the significant comparing from the other methods such as mean absolute 
error (MAE) 175.33995, mean square error (MSE) 31.8608, mean average 
percentage error (MAPE) 9.16091, sum of square error (SSE) 89270.45, 
R-square 0.829511, and R—square adjusted 0.82670. Also, these techniques 
indicated that the 8 selection criteria were lower than the other techniques 
including Akaike information criterion (AIC) 47.25915, generalized cross 
validation (GCV) 47.27169, Hannan-Quinn (HQ) 47.60351, RICE 
(47.2845), SCHWARZ 51.7099, sigma square (SGMASQ) 46.50605, 
SHIBATA 47.23489, and final prediction error (FPE) 47.25929. Therefore, 
the study recommended that the best random forest and M-Hampel models 
are helpful to show the minimum issues and efficient validation for 
analyzing and comparing big data. 
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1. INTRODUCTION 


Agriculture is an aspect of food security, and its problems have been severe in several regions of the 
world. Agriculture contributes to reducing poverty alleviation through lowering food rates, creating 
employment, enhancing farm incomes and rising wages. Food security considers the capacity of the human to 
produce sufficient food every day, provide nutrition, and minimize environmental impact [1]. Despite the 
demand for food, productions have been increased because the population raised throughout the world [2]. 
More food is needed in the context of population increase to fulfil the demands of developing countries [3]. 
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Dealing with agricultural problems requires effective techniques to address the challenges of sustainable 
agriculture and food security. 

Sustainability agriculture and food security have used big data technologies expected to become 
more widespread in the future and evaluate the issues. Big data technologies are used in agriculture for the 
accuracy and enhancement of sustainable agriculture and food security [4]. Big data has a combination of 
many observations and a more significant number of variables [5], [6]. 

However, using too many variables in regression models becomes a problem, especially if there are 
irrelevant variables. Irrelevant variables can lead to noise and negatively influence the regression model [7]. 
Irrelevant variables have implications for model constrain that have higher variance and bias. In addition, the 
existing model can significantly affect the statistical analysis, which can lead to overfitting. For addressing 
these issues, we will propose machine learning as variable selection. Further, machine learning and big data 
are performed to solve the problems of agriculture activity [8]. 

Machine learning aims to solve the issue of irrelevant variables. Machine learning provides the rank 
of significant variables. The highest essential variables are the ranking of the independent variables that 
contribute to the dependent variable [9]. Variable selection is a significant challenge faced in today’s world 
in big data, waiting to be explored. Variable selection results in variable importance [10]. 

Variable important represents the machine learning relevance (significance) of each variable in the 
data concerning its effect on the generated model. Variable importance is the process of selecting a suitable 
variable subset from the original variables. Variable important has been proven capable of improving 
regression accuracy and reducing the complexity of the learning model [11]. Machine learning has two 
groups such as supervised and unsupervised learning. Supervised learning is also divided into two sub-groups 
such as classification and regression. The dependent variable in the classification is discrete and continuous 
in regression. 

The regression analysis has been used in this study for addressing the relationship between 
independent and dependent variables. Regressions in big data continue to get significant appreciation and 
attention. However, reversals in big data still have open problems, such as irrelevant variables [12] and 
outliers [13]. Numerous machine learning has been suggested to handle regression in big data problems such 
as unsupervised, reinforcement, supervised, and semi-supervised learning in different areas [14]. In this 
study, we will use machine learning-based variable selections such as random forest (RF), support vector 
machines (SVM), and boosting. This study used supervised machine learning techniques based on variable 
selections such as random forest, support vector machines, and boosting. Therefore, the subset of the highest 
30 influential variables has been taken from each technique. 

The second issue in regression-big data is an outlier. An outlier is a data point that is significantly 
different from surrounding points [15]. The outliers can occur due to various reasons such as human error, 
mechanical error, and instrument error [16]. The existence of outliers in a regression model is inaccurate and 
incorrectly identified model. Furthermore, an incorrect estimate model can lead to rising erroneous results 
and significantly influence the mean and standard deviation and lead to either over-or underestimated values 
[17]. 

Ordinary least square regression can be very sensitive to outliers, and outliers can strongly distort 
and unreliable results. Several robust-to outliers have been proposed in the statistical literature [18], [19]. 
M-robust regressions can handle outliers [20]. M-robust regressions include M-bi square Tukey, M-Hampel, 
and M-Huber. 

The objectives of this study were to address the problems based on both irrelevant variables and 
outliers. Also, this study used a hybrid model such as machine learning and M-robust regression techniques. 
The main objectives of this study were to examine the impact on the variable selection of three different 
techniques of machines learning, including random forest, support vector machine, and boosting. In addition, 
this study was addressed outliers based on M-robust regression techniques such as M-bi square, M-Hampel, 
and M-Huber. 


2. METHODS 
2.1. Regression learning 

Regression is a challenging issue in various field of knowledge. This study investigates 
performances regression in big data. Regression is a method used to build the predictive model [20]. 
Regression analysis is a supervised machine learning technique for building a model and evaluating its 
performance for a continuous response based on the relationship among several variables. Regression is one 
of the main tasks in machine learning and has been successfully applied to many areas such as sustainable 
agriculture and food security [21]. The multiple regressions construct and assess the model based on the 
relationship between independent and dependent variables [22]. The main purpose of these methods is 
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training the relationship between independent variables, x; = (xis Xiz 3 Xip) a dependent variable y;, for n 
observations, p the number of variables, (x;,y;,)/-,. i). Determine the casual relationship between 
independent variables x; = (Xit Xiz» s Xip) and a dependent variable y; = (y1, Y2» =», Yn){=1; li) predictive 
yi based on a set of variables xj, Xi2, ...,Xip; and iii) screening Xi1, Xi2, +, Xip to select variables that have a 
most significant effect than others to describe the dependent variable y;. Regression learning tasks can be 
stated as learning a function g: x > y from a learning set £ = (x, y). The purpose of regression learning is to 


find a model in such that its prediction (x) which denoted by Y that as good as possible and Y; is continuous 
[23]. 


2.2. Random forest 

Random forest is used for classification and regression. The prediction in classification is based on 
the majority votes of the predicted values, and in regression is based on average [24]. Random forest is a 
tree-based ensemble learning. Random forest takes bootstrapped sample each of the ensemble. Random 
forest takes several subsets of nominee variables at every node when trees are built. The nominee variables 
are random subset of m independent variables from p independent variables. The m is a parameter which is 
controlled by user. In the original article, m equal log,(p + 1) [25]. Later, researchers applied default as 


mx VP for classification and m ~ Z for regression [26]. 


Algorithm [26] random forest refers to Breiman algorithm. 

Given, D: dataset with n observations, p independent variables, and one dependent variable. 

Procedure: 

Forb=1ton 

— Generate bootstrapped sample Dý from the training set D. 

— Grow a tree using a m from bootstrapped sample D}. 
For a given 3 mode i) Randomly choose m variables, ii) Find the best split variables and values, and iii) Split a node using the best 
split variables and values. 

Repeat i) — iii) until stopping rules are met. 


Radom forest has advantages: i) low complexity is o(nlog(n)), ii) robustness to handle is in unbalanced 
dataset, and iii) embedded variable selection is to rank variables by important [24]. The issue is important 
variables which a bias toward correlated independent variables [27]. 


2.3. Boosting 

Boosting aims to improve model’s accuracy. It has an idea to find and the average of thumb than to 
see single learning [28]. Boosting is applied to training data in a step-by-step manner, apply proper methods 
regularly to place more emphasis on observations [29]. 

The data {x;, y;}§_, of known (x,y) — values. Boosting aims to get an approximation F(x). The 
function F*(x) aims mapping x to y which minimizes the fitted values for loss function L(y, F(x)) include 
squared error (y — F}? for y € R [30]. 


Algorithm boosting: 


Given: (xi, Fiare (Xn Yn) 
where x; E X, y; EY 


Initialize D4 (i) = = 

— Train base learner using distribution D, 
— Get base regression f,:X > R 

— Choose a, E R 

— Update 


DaD = DtlDexp(-atyift(xi)) 


m 


The Z; is a normalization factor (chosen so that D,,, will be a distribution) output the final regression. 
F* =arg min Ey xL(y, F (x)) = argmin E; [Ex ( L(y, F(x))) lx] 


2.4. Support vector machines-regression 

Support vector machines (SVM) is developed by Vapnick that it is a learning system based on 
structural risk minimization (SRM) [31]. The traditional empirical risk minimization (ERM) principle 
minimizes the errors in training data, while SRM minimizes ERM an upper bound on the expected risk. The 
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SVM more accurate in generalization. The SVM was previously applied to overcome classification issue 
[32]. However, the SVM can also be applied to overcome regression problems by a loss function. 

The € loss function is frequently applied for regression purposes [33]. If the £ is smaller than 
predictive error that will be ignored [32]. In most cases, the € is a small positive number or zero such as 
0.001. Support vector machines has the € — incentive loss function ||y — f(x)|l- = max{0, lly — f @)|| — £}. 
The £ > 0 and creating a tube around the true output. 

The primal becomes: 


t(w, €) = lol? + DEG + &) (1) 
Subject to ((P(xj),w) + b) — y; SE - ài (2) 
yi — Ub (xi), w) + b) <e- & 
§ 20 (i = 1,..,m) 


The formulation can estimate the accuracy of support vector machines — regression by computing the scale 
parameter of a Laplacian distribution on the residuals. The f (x) is the estimation decision function [34], [35]. 


2.5. M robust-regression 

In regression analysis, outliers in a dataset can cause the least squares estimator to distort and 
produce unreliable results. To deal with this issue, a number of robust to outliers’ methods have been 
proposed in the statistical literature. Numerous robust to outlier’s methods have been proposed in the 
statistical literature to address this issue. The traditional methods usually decrease the efficiency in estimating 
the population parameters as these methods are sensitive to outliers [36] 

Therefore, in the present study we adapt the various robust regression techniques such as 
M-estimation such as M Tukey bi square, M Hampel, and M Huber. The M in M-estimation is “Maximum 
likelihood”. We consider only the linear model: 


Yi = Bo + Bixis + Boxig ++ + PpXip + Ej (3) 
Vi = xT b + £i (4) 


For the i-th of n independent observations. We assume that the model itself is not at issue, therefore 
E(y|x) = xiß, the distribution of the errors may be heavy tailed, producing occasional outliers [37]. Given 
an estimator b for p, the fitted model is 


fi = bo + bixi + bzXi2 + hi + bp Xip = x} b (5) 


ailp) = yi- xi P (6) 


With M-estimation including M-—bi square, M—Hampel, and M-Huber; the estimate b determined by 
minimizing a particular objective function over all b. 

The p gives the contribution of each residual to objective function [38]. The p requires the properties: 

— p(e)=0 

— Equal to zero when its argument is zero, p(e) = 0 

— Symmetric, p(€) = p(-e) 

— Monotone in |e;l, p(ei) = p(er) for |e;| > lel 

M-estimation principle is to minimize the residual function p: 


Bu = min p(y; SA xib) (7) 


The (7) has to solve: 


(8) 


o 


k 
aoi GANES ee 
min X= (=) = mino (See 


The function ø is in (9) 
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MAD _ median|e;—median(e;)| 
0.6745 0.6745 


6= 


(9) 


The M-estimator of scale & is found by solution of (10): 


iyo (2) =i CE) =k (10) 


o 


The £ is the p x 1 parameter vector, then the function w — type could be yielding as (11): 


0g; : 
LVE) forj = 1,2,...,p (11) 
The derivative w(e) = oe is the influence function. Then the weight function colud be defined as (12) 


w(e) ="2 (12) 
The w(e) — type function becomes: 


Di weer St = 0, for j = 1,2, ..,p (13) 
And the object becomes to obtain the following iterated re-weighted least square problem: 
min Ziw(e™?)e? (14) 


The k is the number of iterations. Further, the M robust regressions have been used to eliminate the outliers 
using M-bi square, M-Hampel, and M-Huber [39], as well as more detail showed in Table 1. 


Table 1. Formulas for robust regression M-estimation 


Methods Objective Function Weight Function 
Bi-Square k2 2T 272 
ne ef -f1-()] for lel <k sie k-60 for lel <k 
Ta 
= for lel >k (0) forle|>k 
Huber 1,2 for lel <k fi for lel <k 
2 = = 
P = Wau =) KE 
oo ua -1k for lel >k ie) FOr lel < k 
Hampel 2 0<|el<a 1 for0<|el<a 
; o2 2 forb<\|el<c 
PHa = alel -4 b<|el<c WH =} ll 
5-1 
=a AE e = a®— forb<ļeļ<c 
en £) +5(b+e a) ,b<|el<c cb f lel 


2.6. Selection models 
2.6.1. Phase 1 — all possible models 

For this study, the dataset is interacted only second order. Where N is the number of all possible 
models, k is total number of independent variables and j=1, 2,..., k. 


N = SEC) (15) 


A dataset containing 1924 observations will use to study the effect of 29 different independent variables on 
the on the one dependent variable. Then the data will be interacted with in the second interaction. The data 
contain the effect of 435 different interaction independent variables on the dependent variable. 


2.6.2. Phase 2 — selected models 

In this paper, we will analyze the machine learning as variable selection including random forest, 
support vector machine, and boosting. We will take subset of top 30 highest influential variables from each 
technique and will apply three M robust regression including M-bi square, M-Hampel, and M-Huber. 
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2.6.3. Phase 3 — the best model 

The next step was to get the best model after a list of selected models was obtained. Eight selection 
criteria (8SC) were defined for these purposes by [40]. The 8SC formula can be displayed as shown in 
Table 2. By using mentioned formulas in Table 2, Akaike information criterion (AIC), RICE, final prediction 
error (FPE), SCHWARZ, generalized cross validation (GCV), sigma square (SGMASQ), SHIBATA, and 
Hannan-Quinn (HQ) information on the basis of the minimum value obtained from all mentioned criteria. 


Table 2. Formula used for 8SC 


No Methods Formulation Reference 
AIC sse pcs 41] 
2, RICE "E9 42) 
me) 
3. Final prediction (=) n+(k+1) 41 
error (FPE) n /n-(k+1) 
4. Schwarz (= nw) 43 
5. GCV (E 44 
LE 
6. SGMASQ (=) 44 
He 
Ts SHIBATA (=) n+2(k+1) 45 
2 alle) 46] 
ei HQ (=) Inn n 


2.7. Validation models 

The validation of model including mean average error (MAE), mean average percentage error 
(MAPE), sum of square error (SSE), R-Square, and R-Square Adjusted are measured for evaluating the 
model performances. SSE measure the discrepancy the data and an estimation model. Generally, the lower 
SSE shows which model can better explain, and the higher SSE shows which model poorly describes the data 
[47]. Besides that, the value of R-square is common method to explain the goodness of fit in regression 
model. R-square interprets how many percentages of the variation is explained by the independent variables. 
R-square=1 interprets that the model is in fitting with real data [48]. Mean square error (MSE) measures the 
average of the squared deviation between the fitted values with the actual data observation [49]. The 
validation of model determinations as shown in Table 3. 


Table 3. Formulas for validation methods 


No Validation Formulation Reference 
r 552 
1 Sum of square error SSE = EL (Y E Ê) 47] 
2 Sum of square total SST = 5, (Ê - VAM 47] 
i= 
3 R-squared R2 = SSR _ SSTSE q _ SSE 48] 
SST SST | SST 
4 Mean absolute error (MAE) MAg =iyn j E 50] 
n=l] 9; 
5 Mean square error (MSE) igon [YPN 50] 
MSE = 758, (F) 
6 Mean average percentage MAPE = Lyr A Y-?i 51] 
i= fi 


error (MAPE) 


3. RESULTS AND DISCUSSION 
3.1. Data 

The data was collected from time period of 8:00am until 5:00 pm starting on 08/04/2017 to 
12/04/2017. That is almost four days data. The original data was for each second and then it was converted in 
hour for data analysis. The variables taken are data contain hourly solar radiation, temperature, humidity, and 
moisture content. The detailed factor of modelling is shown in Table 4. 

In this paper, a dataset containing 1924 observations will use to study the effect of more 29 different 
independent variables on the one dependent variable. Significance of interaction terms had also been 
observed in this study. So, T1*T2 represents the interaction between T1 and T2. Another example H1*PY 
represents the interaction between H1 and PY. The data contain the effect of 435 different interaction 
independent variables on the one dependent variable. 
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Table 4. Factors of modelling 


Symbols Factors Definitions 
Y Dependent Moisture 
H1 Independent Relative Humidity Ambient 
H5 Independent Relative Humidity Chamber 
PY Independent Solar Radiation 
T1 Independent Temperature (°C) ambient 
T2, T3, T4 Independent Temperature (°C) before enter solar collector 
T5 Independent Temperature (°C) in front of down v-Groove (Solar Collector) 
T6, T8 Independent Temperature (°C) in front of up v-Groove (Solar Collector) 
T7, T14, T15, T16, T21, T22 Independent Temperature (°C) Solar Collector 
T9, T10, T11, T12 Independent Temperature (°C) behind inside chamber 
T13, T17, T18, T19, T23 Independent Temperature (°C) Infront of (Inside Chamber) 
T20, T23, T24, T25, T28 Independent Temperature (°C) from solar collector to chamber 


3.2. Result 

The validation metrics are sum square of error (SSE), mean absolute error (MAE), mean square 
error (RMSE), mean absolute percentage error (MAPE), and R-Square. They are comparing three machines 
learning-based variable selection and M-Robust regression algorithms-in terms of the best model eight 
selection criteria (8SC). In this study, we have described the three variable selection techniques which have 
been used such as random forest, support vector machines, and boosting. 

Table 5 shows the final result obtained through each method to obtain the important by ranks. It 
shows the subset of 30 variable important that is taken by each technique. In order to measure the prediction 
accuracy, predicted responses with the actual responses are compared of each regression-based model in 
terms of the validation methods described in Table 6. 


Table 5. The 30 highest of variable importance 
No Methods Variable Importance 
1 Random Forest T8, H1*PY, T8*H5, T8*H1, T2*T6, T2*T7, T1*T6, T21*H5, T7*H1, H5*PY, T9, T19*H5, T7, H1*H5, 
T7*PY, T22*H5, T6*T8, T10*H5, T11*H5, T6*T13, T12*T13, T9*H5, H1, T7*T9, T6*T9, T8*T23, 
T23*H1, T3*T8, T4*T8 
2 Support Vector T1*T6, T2*T6, T17*H1, T6*T13, T19*H1, T22*H1, T17*PY, T1*T2, T27*PY, T28*PY, T21*H1, 


Machines T27*H1, T9*PY, T22*PY, T10*PY, T5*PY, T19*PY, T21*PY, T1*PY, T6*T29, T2*PY, T2*T13, 
T29*PY, T13*PY, T23*PY, T28*H1, T12*PY, T14*PY, T26*H1, T11*PY 
3 Boosting T2*T6, T1*T6, HS*PY, T7*H1, T5*PY, T21*HS, T8*PY, T7*T9, T8, T2*T7, T6*T13, T19*H5, T26*H1, 


T7*PY, T1O*PY, T2*T9, T8*T29, T11*HS, T17*HS, T6*T9, T9, T1*T2, T1*T9, T12*PY, T7, T26*HS, 
T12*HS, T9*T29, T25*H5, and T1*T8 


Table 6 show the validation model metric that the random forest-Hampel exhibited the lowest error 
data. It could be assumed that random forest-Hampel’s is method can rely on the investigation of the 
accuracy in big data obtained from machine learning-robust regression. Random forest-Hampel obtained 
significantly better results than others. 


Table 6. Results for the validation model 


Machine Learning Robust Regression MAE MSE MAPE Sum Square of Error R-square 
Random Forest Bi-Square 235.16695 183.165 12.28667 238913.7 0.543723 
Hampel 175.33995 31.8608 9.160917 87570.9 0.838757 

Huber 221.3641 42.8569 11.56552 89270.45 0.829511 

Support Vector Machines Bi-Square 209.086525 63.4550 10.92406 191406.1 0.634453 
Hampel 249.01216 57.1446 13.01004 134216.8 0.743673 

Huber 237.3297451 52.8000 12.39967 136270.1 0.739752 

Boosting Bi-Square 281.774977 1837.10 14.72179 121532.1 0.767898 
Hampel 184.06188 50.2921 9.616608 86894.18 0.834049 

Huber 187.4855378 64.3844 9.795483 88406.59 0.831161 


All possible models have 9 models were machine learning including random forest, support vector 
machines, and boosting and M robust regression including Tukey-bi square, Hampel, and Huber. The results 
obtained from 8 selection criteria are observed in Table 7. The minimum value for 8 selection criteria was 
found for model random forest-Hampel. The minimum value of 8 selection criteria for random forest-Hampel 
represented the efficient model obtained in phase 3. 
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Table 7. Results for 8 selection criteria for machines learning—M robust regression 


ML Robust AIC GCV HQ RICE SCHWARZ SGMASQ SHIBATA FPE 
Regression 
Random Bi-Square 128.9339 128.9681 129.8734 129.0031 141.0766 126.8793 128.8677 128.9343 


Hampel 46.90639 47.23564 46.9191 51.3103 46.14667 46.86987 46.89408 46.89395 
Huber 48.17634 48.18912 48.52738 48.20218 52.71347 47.40863 48.15161 48.17648 
Support Bi-Square 103.2956 103.323 104.0483 103.351 113.0237 101.6495 103.2426 103.2959 


Forest 


Vector Hampel 72.43242 72.45163 72.9602 72.47127 79.25392 71.27817 72.39523 72.43263 
Machines Huber 73.5405 73.56001 74.07638 73.57995 80.46636 72.3686 73.50274 73.54071 
Bi-Square 49.73711 49.7503 66.06481 49.76379 54.42122 48.94452 49.71157 49.73725 

Boosting Hampel 47.25915 47.27169 47.60351 47.2845 51.7099 46.50605 47.23489 47.25929 


Huber 47.71015 47.7228 48.05779 47.13574 52.20336 46.94986 47.68565 47.71028 


The random forest-Hampel has the lowest validation of model including MAE (175.33995), MSE 
(31.8608), MAPE (9.160917), SSE (87570.9), and R-square (0.838757). The random forest-Hampel has the 
lowest 8 selection criteria including AIC (46.90639), GCV (47.23564), HQ (46.9191), RICE (51.3103), 
SCHWARZ (46.14667), SGMASQ (46.86987), SHIBATA (46.89408), and FPE (46.89395). In short, we can 
conclude that random forest-Hampel has generated the lowest error data, which provides the most relevant 
data in the context of validation model and 8 selection criteria. 

The MAE, MSE, MAPE, SSE and R-square are useful measure widely used in validation model. 
The lowest of validation model is random forest-Hampel than others. The MAE, MSE, and SSE are used in 
explaining how well the regression model is toward to the model data. In particular, the explained MAE, 
MSE, and SSE measure the variation for the error between the predicted and actual data. Hence, the random 
forest-Hampel has the lowest bias and variation. The highest of R-Square (0.829511) is the random 
forest-Hampel. The R-squares measures variation which was accounted for the predicted data. We suggest 
that the dependent variable 82.9511% by the independent variables. 

Variable selection can build a useful regression model. Variable selection can increase accuracy and 
reduce model complexity. Variable selection consists of selecting the variables that have the most significant 
influence on dataset of regression [52]. Variable selection has drawn our attention that the important variable 
techniques were developed independently in many disciplines [53]. Variable selection has resulted in a subset 
of important variable. Variable important is rank the variables according to an important variable measure. 
Variable selection attracts researchers who deal with machine learning. 

Random forest has aim to reduce dimensionality [54]. Random forest is easy and fast to implement, 
provides very accurate predictions and can manage many variables without overfitting [55]. Random forest is 
well suited for medium to big data. Random forest has good predictive performance in practice. Moreover, 
random forest provides some measured of the importance of the variables with respect to the prediction of the 
outcome variable. Random forest is interesting in the machine learning research which concerns ensemble 
learning which generates regression model. Random forest is broadly accepted in which the performance of 
set many variables selection is usually more beneficial than others [56]-[59]. 

M-robust regressions are an analysis that is applied if there are outliers in the dataset [19]. In this 
study, the result obtained that M-Hampel robust regression gives the lowest in MAE, MSE, MAPE, sum 
squares of error and gives highest in R square and R square adjusted. M-Hampel robust regression is 
outperform than others [60]. 

The random forest-Hampel gives lowest in MAE, MSE, MAPE, SSE, and gives highest in R square. 
Sustainability agriculture and food security are regarded as two of the most important economical parts of 
Malaysia. Random forest presents two characteristics, such as high prediction accuracy and information 
associated with variable importance [61]. Sustainability agriculture and food security are two sectors that are 
benefiting strongly the development of both machine learning and M-robust regression in the latest years. 

Machine learning and M-robust regression have emerged with big data technologies and 
high-performances computing to create new opportunity to unravel, quantity, and understand data intensive 
processes in agriculture operational environments [62], [63]. Machine learning and statistics learning applies 
in more and more scientific fields such as sustainability agriculture and food security [64]. 

Machine learning and statistics learning are two core techniques for building precision agriculture 
systems. Recently, modelling in mathematics has been proposed to promote the modernization of agriculture 
for increasing both sustainability agriculture and food security greatly. Machine learning and statistics 
learning are used to analyze the agriculture data for smart decision-making [65]. The sustainability 
agriculture and food security are more reliable, capable, and help to boost productivity [66]. Random forest 
has shown a reliable and accurate model to predict paddy showing a very high accuracy, which is aimed for 
sustainability agricultural and food security [67]. The random forest-Hampel’s provides the most relevant 
data of the result which applied for sustainability agriculture and food security. 
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4. CONCLUSION 

The results show that the random forest-Hampel model provides the best model compared to other 
existing methods used in the analysis. The proposed hybrid model is found to be better in terms of MAE, 
MSE, MAPE, SSE, and R-square values than other existing methods. The random forest-Hampel provides 
the best 8 selection criteria including AIC, GCV, HQ, RICE, SCHWARZ, SGMASQ, SHIBATA, and FPE. 
The random forest-Hampel’s provides the best model which should be applied for Sustainability Agriculture 
and Food Security. 
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